An introduction to flowmix

flowmix is a package that estimates a sparse mixture of regressions model using an expectation-maximization (EM) algorithm.

Model

Basic model. Consider response data \(y^{(t)}\) and covariate \(X^{(t)} \in \mathbb{R}^{p}\) observed over time \(t = 1,\cdots, T\).

In our setup, \(y^{(t)}\) is a collection of \(n_t\) \(d\)-variate data points \(y_i^{(t)}\). Because our main application is flow cytometry which measures cell-level attributes in a fluid, we will call these particles, and \(y^{(t)}\) cytograms.

Now, conditional on covariate \(X^{(t)}\) at time \(t\), each particle is modeled to come from a probabilistic mixture of \(K\) different d-variate Gaussians:

\[ y^{(t)}_i | X^{(t)} \sim \mathcal{N}_d \left(\mu_{kt}, \Sigma_k\right) \text{ with probability } \pi_{kt},\]

where \(\pi_{kt}\) is the \(k\)’th cluster’s relative abundance at time \(t\), and \(\mu_{kt}\) is the \(k\)’th cluster center at time \(t\).

For each \(k=1,\cdots, K\), cluster centers \(\mu_{kt} \in \mathbb{R}^d\) and cluster probabilities \(\pi_{kt}\) at time \(t\) are modeled as functions of \(X^{(t)}\):

\[\mu_{kt} = \beta_{0k} + \beta_k^T X^{(t)}\]

and

\[\pi_{kt} = \frac{\exp(\alpha_{0k} + {X^{(t)}}^T \alpha_k)}{\sum_{l=1}^K \exp(\alpha_{0l} + {X^{(t)}}^T \alpha_l)}\]

for regression coefficients \(\beta_{0k} \in \mathbb{R}^d\), \(\beta_{k} \in \mathbb{R}^{p \times d}\), \(\alpha_k \in \mathbb{R}^p\), and \(\alpha_{0k} \in \mathbb{R}\) The covariance \(\Sigma_k\) determines shape of \(k\)’th Gaussian cluster, and is assumed to be constant over time, and not determined by covariates.

Sparse estimation. In practice, there are a large number of covariates \(X^{(t)}\) that may in principle be predictive of \(y^{(t)}\). Also, the number of regression parameters is \((p+1)(d+1)K\), which can be large relative to the number of cytograms \(T\). Furthermore, we would prefer models in which only a small number of parameters is nonzero. Therefore, we penalize the log-likelihood with lasso penalties on \(\alpha\) and \(\beta\).

The two regularization parameters are \(\lambda_{\alpha}\) and \(\lambda_{\beta}\) govern the amount of sparsity in the regression parameters, and are estimated using cross-validation.

Limit cluster mean movement. Also, in our ocean application, each cell population has a limited range in optical properties, due to biological constraints. We incorporate this domain knowledge into the model by constraining the range of \(\mu_{k1}, \cdots, \mu_{kT}\) over time. Since \(\beta_k^TX^{(t)} = \mu_{kt} - \beta_{0k}\), limiting the size of \(\beta_k^TX^{(t)}\) is equivalent to limiting the deviation of the \(k\)’th cluster mean at all times \(t=1,\cdots,T\) away from the overall center \(\beta_{0k}\). Motivated by this, we add a constraint so that \(\|\beta_k^T X^{(t)}\|_2 \le r\) for some fixed radius value \(r>0\).

The constraint also plays an important role for model interpretability. We wish for the \(k\)’th mixture component to correspond to the same cell population over all time. When a cell population vanishes we would like \(\pi_{kt}\) to go to zero rather than for \(\mu_{kt}\) to move to an entirely different place in cytogram space.

Lastly, the model estimates are obtained using an expectation-maximization (EM) algorithm, which uses Rcpp and some clever matrix algebra and is optimized to be fast.

For more details about the model and algorithm, please refer to the full paper: link.

Next, we use artificial and real data to demonstrate how to use the package. The main function is flowmix().

Examples

library(flowmix)
#> Warning: replacing previous import 'RcppArmadillo::fastLmPure' by 'RcppEigen::fastLmPure' when loading 'flowmix'
#> Warning: replacing previous import 'RcppArmadillo::fastLm' by 'RcppEigen::fastLm' when loading 'flowmix'
#> Warning: replacing previous import 'CVXR::id' by 'dplyr::id' when loading 'flowmix'
library(tidyverse)

Artificial data

First, generate data:

This produces three dimensional cytograms ylist and covariates X.

The first cytogram \(y^{(1)}\) looks like this.

Especially if your data is a time series, it could be useful to plot the covariates \(X^{(t)}\) once across time \(t=1,\cdots, T\).

Now, we estimate the data with fixed regularization parameters \(\lambda_\alpha=0.01\) and \(\lambda_\beta=0.01\), and \(K=10\) clusters.

Internally, flowmix() repeats the estimation five times (the default), and returns the estimated model out of five runs with the best data fit.

The cluster probabilities over time look like this:

Showing the model across time, in an animation (scatterplot_2d() being the main data plotter for 2-dimensional data):

Cross-validation

One EM algorithm = one job

Next, one “job” (using the function one_job(ialpha, ibeta, ifold, irep)) is to run the EM algorithm once, for:

  • the ialpha-th \(\lambda_\alpha\) value (out of prob_lambdas).
  • the ibeta-th \(\lambda_\alpha\) value (out of mean_lambdas).
  • the ifold-th test fold out of the nfold=5 CV folds.
  • the irep-th repeat of the EM algorithm (nrep=10 in total)

After each job is run, the result is saved in a file named like [ialpha]-[ibeta]-[ifold]-[irep]-cvscore.Rdata.

The cross-validation is designed this way because (1) it is clearly and conveniently parallelizable, and (2) a large amount of computation is required for even a normal-sized job, using a 5-fold cross-validation with 10 restarts, over a 10 x 10 grid of regularization values. Running each job and saving them to separate files, allows the user to parallelize, and save and restart jobs conveniently.

Also, the nrep estimated models for any given ialpha and ibeta in the full data are obtained using one_job_refit() (again, saving to files named [ialpha]-[ibeta]-[irep]-cvscore.Rdata):

As we’ve mentioned above, since all of this is clearly parallelizable, it’s recommended to use multiple computers or servers for the full cross-validation.

A single function

cv.flowmix conducts cross-validation by wrapping most of the above into a single function.

(This code takes long, so it’s recommended that you run it separately in a script; use mc.cores option to run the jobs on multiple cores):

Then, the results are saved into separate files whose names follow these rules: - “1-1-1-1.Rdata” for ialpha-ibeta-irep-ifold.Rdata, having run the CV. - “1-1-1-cvres.Rdata” for having estimated the model in the full data

After the cross-validation is finished, the results are summarized from these files, and optionally saved to a file summary.RDS (if save=TRUE):

The final model chosen by cross-validation is this:

Binning data

If the data contains too many particles, we can reduce the size of ylist and instead deal with binned counts.

The new object countslist can be additionally input to flowmix().

Here is an example. make_grid(ylist, gridsize=30) makes an equally sized grid of size 30 from the data range, in each dimension. Then, bin_many_cytograms() places the particles in ylist in each of these bins. The resulting object is a list which contains the grid centers ybin_list and the counts in each counts_list.

(Not used here, but optionally, you can upweight each particle, e.g. using the biomass of each particle.)

Real data

You can repeat the above code blocks, with real data.

Now visualizing the results as before.